BinaryBERT: Pushing the Limit of BERT Quantization
135
FIGURE 5.11
Loss landscapes visualization of the full-precision, ternary, and binary models on
MRPC [230].
where x ∈{±0.2 ¯
W1, ±0.4 ¯
W1, ..., ±1.0 ¯
W1} are perturbation magnitudes based the absolute
mean value ¯
W1 of W1, and similar rules hold for y. 1x and 1y are vectors with all elements
being 1. For each pair of (x, y), the corresponding training loss is shown in Fig. 5.11. As can
be seen, the full-precision model has the lowest overall training loss, and its loss landscape
is flat and robust to the perturbation. For the ternary model, despite the surface tilts up
with larger perturbations, it looks locally convex and is thus easy to optimize. This may
also explain why the BERT models can be ternarized without severe accuracy drop [285].
However, the loss landscape of the binary model turns out to be higher and more complex.
By stacking the three landscapes together, the loss surface of the binary BERT stands on
the top with a clear margin with the other two. The steep curvature of loss surface reflects
a higher sensitivity to binarization, which attributes to the training difficulty.
The authors further quantitatively measured the steepness of loss landscape, start-
ing from a local minima W and apply the second order approximation to the curvature.
According to the Taylor’s expansion, the loss increase induced by quantizing W can be
approximately upper bounded by
ℓ( ˆ
W) −ℓ(W) ≈ϵ⊤Hϵ ≤λmax∥ϵ∥2,
(5.19)
where ϵ = W −ˆ
W is the quantization noise, and λmax is the largest eigenvalue of the
Hessian H at w. Note that the first-order term is skipped due to ∇ℓ(W) = 0. By taking
λmax [208] as a quantitative measurement for the steepness of the loss surface, the authors
separately calculated λmax for each part of BERT as (1) the query/key layers (MHA-QK),
(2) the value layer (MHA-V), (3) the output projection layer (MHA-O) in the multi-head
attention, (4) the intermediate layer (FFN-Mid), and (5) the output layer (FFN-Out) in the
feed-forward network. From Fig. 5.12, the top-1 eigenvalues of the binary model are higher
FIGURE 5.12
The top-1 eigenvalues of parameters at different Transformer parts of the full-precision (FP),
ternary, and binary BERT.